In [3]:
import pandas as pd
from datetime import datetime
In [4]:
def parse(x):
    return datetime.strptime(x,"%m/%d/%Y")
df=pd.read_csv("https://raw.githubusercontent.com/srivatsan88/YouTubeLI/master/dataset/amazon_revenue_profit.csv",parse_dates=['Quarter'],date_parser=parse)
df.head()
Out[4]:
Quarter Revenue Net Income
0 2020-03-31 75452 2535
1 2019-12-31 87437 3268
2 2019-09-30 69981 2134
3 2019-06-30 63404 2625
4 2019-03-31 59700 3561
In [7]:
amazon_df=df.set_index("Quarter")
amazon_df.head()
Out[7]:
Revenue Net Income
Quarter
2020-03-31 75452 2535
2019-12-31 87437 3268
2019-09-30 69981 2134
2019-06-30 63404 2625
2019-03-31 59700 3561
In [9]:
import plotly.express as px
fig=px.line(df,x="Quarter",y="Revenue",title="Amazon Revenue Slider")
fig.show()
In [11]:
# Same plot & Splitting the data 1Y,2Y,3Y like this 
fig=px.line(df,x="Quarter",y="Revenue",title="Amazon Revenue Slider")
fig.update_xaxes(
    rangeslider_visible=True,
    rangeselector=dict(
        buttons=list([
            dict(count=1,label="1Year",step="year",stepmode="backward"),
            dict(count=2,label="3Year",step="year",stepmode="backward"),
            dict(count=3,label="5Year",step="year",stepmode="backward"),
            dict(step="all")
            
        ])
    )
)

fig.show()

now see this graph seasonality increase very constant till 2009, but 2010 slidely higher pick then it has been constant , again 2014 pick has been higher ,and similarlly pick has been incresing over time . If you see 2006 to 2010 time series is mostly stationary , but aftere that it not being stationary

Now unserstand data stationary or not
NULL: Data Stationary

Alternate Hypothesis: Data not stationary

In [12]:
#kpss test 
from statsmodels.tsa.stattools import kpss

What the KPSS test does said ?

it helps us to determining time series is stationary arround a mean or linear trend

In [13]:
tstest=kpss(amazon_df["Revenue"],"ct")
tstest
C:\ProgramData\Anaconda3\lib\site-packages\statsmodels\tsa\stattools.py:2018: InterpolationWarning:

The test statistic is outside of the range of p-values available in the
look-up table. The actual p-value is smaller than the p-value returned.


Out[13]:
(0.30665545975169556,
 0.01,
 4,
 {'10%': 0.119, '5%': 0.146, '2.5%': 0.176, '1%': 0.216})

Test statistic-0.3066

0.176>2.5 ---NULL HYPOTHESIS rejected

so, our data is not stationary

Now statsmodels package doing seasonal decompose,( beacuse we have seasonality). seasonal decomposed help us to determine what model we want additive/multiplicative . if your data set is stationary then we will use Additive model , when not stationary then we will use Multiplicative model

In [15]:
import statsmodels.api as sm
res=sm.tsa.seasonal_decompose(amazon_df['Revenue'],model="multiplicative")
resplot=res.plot()

2nd picture - There is a incresing trend .. 3rd picture- we know our data is seasonality , so there is a seasonal component ... finally residual which is error term / noise term that is kind of difference from the observed and trend value and seasonal value .

Now print the observed value(it is my actual data)

In [16]:
res.observed
Out[16]:
Quarter
2020-03-31    75452.0
2019-12-31    87437.0
2019-09-30    69981.0
2019-06-30    63404.0
2019-03-31    59700.0
               ...   
2006-03-31     2279.0
2005-12-31     2977.0
2005-09-30     1858.0
2005-06-30     1753.0
2005-03-31     1902.0
Name: Revenue, Length: 61, dtype: float64
In [17]:
print(res.trend)# print trend component
Quarter
2020-03-31          NaN
2019-12-31          NaN
2019-09-30    72099.500
2019-06-30    68248.750
2019-03-31    64691.375
                ...    
2006-03-31     2369.375
2005-12-31     2265.000
2005-09-30     2169.625
2005-06-30          NaN
2005-03-31          NaN
Name: trend, Length: 61, dtype: float64
In [18]:
print(res.seasonal)
Quarter
2020-03-31    0.941840
2019-12-31    1.289518
2019-09-30    0.894993
2019-06-30    0.873649
2019-03-31    0.941840
                ...   
2006-03-31    0.941840
2005-12-31    1.289518
2005-09-30    0.894993
2005-06-30    0.873649
2005-03-31    0.941840
Name: seasonal, Length: 61, dtype: float64
In [19]:
res.resid
Out[19]:
Quarter
2020-03-31         NaN
2019-12-31         NaN
2019-09-30    1.084496
2019-06-30    1.063372
2019-03-31    0.979831
                ...   
2006-03-31    1.021253
2005-12-31    1.019256
2005-09-30    0.956844
2005-06-30         NaN
2005-03-31         NaN
Name: resid, Length: 61, dtype: float64

finally i have all the value trend, seasonal,residual value , if i multiply all the three, we will get the original value

In [20]:
res.observed[2]#2nd observed value 
Out[20]:
69981.0
In [21]:
res.trend[2]*res.seasonal[2]*res.resid[2]
Out[21]:
69980.99999999999

Now why do we need decomposed ? -- AR , MA model perform better if data is stationary ,it was better if data is detrended .

for detrended value= observed value/trend value (for additive model observed value-trend value )

plot the value

In [23]:
pd.DataFrame(res.observed/res.trend).plot()
Out[23]:
<AxesSubplot:xlabel='Quarter'>
In [ ]:
the output basically shows that, there are no more trend avaliable, seasonality is captured and no more trend avaliable , data 
become completely detrended ( previous picture data was incresing trend ). now we can use any model